Unveiling the Secrets of Wine Quality¶

Bypassing traditional tasting methods, this project employs data analysis to predict the quality of wine based on its chemical features.

3. Data Analysis¶

In [193]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import os

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn import set_config
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVR

from sklearn.pipeline import make_pipeline
from sklearn.metrics import roc_auc_score

from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LinearRegression

from sklearn import metrics

from scipy.stats import zscore
from scipy.stats import pearsonr
import scipy.stats as stats
import statsmodels.api as sm

3.1 Extract Data: Reading from CSV Files¶

In [113]:
# These csv files are downloaded from the UCI website: 
# http://archive.ics.uci.edu/dataset/186/wine+quality.

df_white_wine = pd.read_csv("raw_data/winequality-white.csv", sep=";")
df_red_wine = pd.read_csv("raw_data/winequality-red.csv",sep=";")
In [114]:
df_white_wine.head()
Out[114]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 3.00 0.45 8.8 6
1 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 3.30 0.49 9.5 6
2 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10.1 6
3 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
4 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
In [115]:
df_red_wine.head()
Out[115]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
In [116]:
wine_lists = [df_red_wine, df_white_wine]

df_white_wine.wine_type = "White Wine"
df_red_wine.wine_type = "Red Wine"

def get_wine_str(wine_type_df):
    return getattr(wine_type_df, 'wine_type', "Unknown Wine")

3.2 Transformation¶

Here we want to create QQ Plots to understand if the features follow normal distribution or not.

In [117]:
def check_numeric_columns(wine_type_df):
    """
    Select and return the names of numeric columns from the given DataFrame.

    Parameters:
    - wine_type_df: a DataFrame containing data related to wine types.

    Returns:
    - list of column names corresponding to numeric columns.
    """
    numeric_columns = wine_type_df.select_dtypes(include=[np.number])
    column_names = numeric_columns.columns

    return column_names
In [118]:
def create_qq_plot(wine_type_df, folder):
    """
    Create Q-Q plot for each of the feautres of dataset, display them on one canva
    
    Parameters:
    - wine_type_df: The DataFrame containing data related to wine types.
    """
    wine_type = get_wine_str(wine_type_df)

    # Select numerical columns
    numeric_columns = check_numeric_columns(wine_type_df)

    fig, axs = plt.subplots(3, 4, figsize=(20, 15))  
    axs = axs.flatten()

    # Q-Q plot 
    for i, column in enumerate(numeric_columns):  
        if i <= 11:
            data = wine_type_df[column]
            stats.probplot(data, dist="norm", plot=axs[i])
            axs[i].set_title(column, fontsize = 30)
            axs[i].set_xlabel('')
            axs[i].set_ylabel('')

    title = f"QQ Plots for {wine_type}"
    fig.suptitle(title, fontsize=40, weight='bold')
    
    plt.tight_layout(rect=[0, 0.03, 1, 0.95])
    plt.savefig(f'figures/{folder}/{title}.png', bbox_inches = 'tight')
    plt.show()
In [119]:
def create_plots(plot_function, saving_folder, wine_list=wine_lists):
    """
    Create plots using a specified plot function for each item in the list od datasets.

    Parameters:
    - plot_function (callable): The function used to generate plots.
    - saving_folder (str): The folder name to save the plots.
    - wine_list (list): The list of items for which plots will be created.
    """
    # Create a folder for graphs if it doesn't exist
    if saving_folder is not None and not os.path.isdir(f'figures/{saving_folder}'):
        os.makedirs(f'figures/{saving_folder}')
    
    # Generate plots for each item in the list
    for item in wine_list:
        plot_function(item, saving_folder)


def use_function(function, wine_list=wine_lists):
    """
    Apply a specified function to each df in the list.

    Parameters:
    - function (callable): The function to be applied.
    - wine_list (list): The list of df for which the function will be applied.
    """
    for item in wine_list:
        function(item)
In [120]:
create_plots(create_qq_plot,'1_QQ_plots', wine_lists)
No description has been provided for this image
No description has been provided for this image

Observations from QQ Plots of Red and White Wine Datasets:¶

  1. Right Skewness:

    • Red Wine: Residual Sugar and Alcohol show deviations in the lower quantiles.
    • White Wine: Similar to Red Wine, Residual Sugar and Alcohol show deviations in the lower quantiles.
  2. Left Skewness:

    • Red Wine: Free Sulfur Dioxide, Chlorides, and Sulphates show deviations in the upper quantiles.
    • White Wine: Volatile Acidity, Chlorides, and Sulphates show deviations in the upper quantiles.
  3. Implications for Data Processing:

    • The observed skewness in both datasets suggests the need for normalization transformations. We will continue to calculate skewness coefficient.
    • Techniques like logarithmic or Box-Cox transformations may be beneficial to address these deviations and improve the homogeneity of the data.
In [121]:
def calculate_skewness_coefficient(wine_type_df):

    print(f"\nThe skewness coefficient of {get_wine_str(wine_type_df)}: \n")

    numerical_columns = wine_type_df.select_dtypes(include=['number']).columns

    for column in numerical_columns:  
        skewness = round(wine_type_df[column].skew(),2)
        print(f"{column}: {skewness}")
In [122]:
# use_function(calculate_skewness_coefficient, wine_lists)

Observations from the skewness coefficient¶

  1. White Wine:
    • Chlorides (5.02): Highly skewness.
    • Volatile Acidity (1.58), Citric Acid (1.28), Residual Sugar (1.08), Free Sulfur Dioxide (1.41): : Moderate skewness.
  2. Red Wine:
    • Residual Sugar (4.54) and Chlorides (5.68): Highly skewness
    • Free Sulfur Dioxide (1.25), Total Sulfur Dioxide (1.52), Sulphates (2.43): Moderate skewness.

We are going to do log transformation for features of highly skewness and moderate skewness.

In [123]:
white_wine_log_columns = [
    'chlorides',
    'volatile acidity',
    'citric acid',
    'residual sugar',
    'free sulfur dioxide'
]

red_wine_log_columns = [
    'residual sugar',
    'chlorides',
    'free sulfur dioxide',
    'total sulfur dioxide',
    'sulphates'
]

log_red_wine_df = df_red_wine.copy()
log_white_wine_df = df_white_wine.copy()

log_red_wine_df[red_wine_log_columns] = np.log(log_red_wine_df[red_wine_log_columns] + 0.001)
log_white_wine_df[white_wine_log_columns] = np.log(log_white_wine_df[white_wine_log_columns]+ 0.001)

log_red_wine_df.wine_type = "Red Wine(Log)"
log_white_wine_df.wine_type = "White Wine(Log)"
In [124]:
log_dfs = [log_red_wine_df, log_white_wine_df]

create_plots(create_qq_plot, "2_QQ_plots_log", log_dfs)
#use_function(calculate_skewness_coefficient, log_dfs)
No description has been provided for this image
No description has been provided for this image

Observations based on the first log transformation:¶

  1. Red Wine (Original vs Log-Transformed):

    • Original: Residual sugar 4.54, Chlorides 5.68.
    • Log-Transformed: Residual sugar 1.81, Chlorides 1.79.
    • Free sulfur dioxide changed from positive (1.25) to slightly negative skewness (-0.23).
  2. White Wine (Original vs Log-Transformed):

    • Original: Volatile acidity 1.58, Citric acid 1.28.
    • Log-Transformed: Volatile acidity 0.14, Citric acid -5.56 (over-correction).
    • Residual sugar reduced from 1.08 to -0.16, Chlorides from 5.02 to 1.19.
  3. Minimal Impact on Some Variables:

    • Alcohol and quality in both Red and White wines showed minimal changes (around 0.86 and 0.22 respectively).
  4. Avoid Log Transformation For:

    • Red Wine: Free sulfur dioxide and Total sulfur dioxide.
    • White Wine: Citric acid and Residual sugar .
In [125]:
white_wine_update_log_columns = [
    'chlorides',
    'volatile acidity'
]
red_wine_update_log_columns = [
    'residual sugar',
    'chlorides'
]

log_red_update_wine_df = df_red_wine.copy()
log_white_update_wine_df = df_white_wine.copy()

log_red_update_wine_df[red_wine_update_log_columns] = np.log(log_red_update_wine_df[red_wine_update_log_columns] + 0.001)
log_white_update_wine_df[white_wine_update_log_columns] = np.log(log_white_update_wine_df[white_wine_update_log_columns]+ 0.001)

log_red_update_wine_df.wine_type = "Red Wine(Second Log)"
log_white_update_wine_df.wine_type = "White Wine(Second Log)"

log_update_dfs = [log_white_update_wine_df, log_red_update_wine_df]
In [126]:
create_plots(create_qq_plot, '3_QQ_plots_log_corrected', log_update_dfs)
#use_function(calculate_skewness_coefficient, log_update_dfs)
No description has been provided for this image
No description has been provided for this image

Observations from the model with log:¶

  1. White Wine:
    • Log transformation significantly reduced skewness in volatile acidity (from 1.58 to 0.14) and chlorides (from 5.02 to 1.19).
  2. Red Wine:
    • Effective reduction in skewness for residual sugar (from 4.54 to 1.81) and chlorides (from 5.68 to 1.79).
  3. Conclusion:
    • The second log transformation was successful in reducing high skewness for key variables in both Red and White Wine datasets.

3.3 Clean Data¶

3.3.1 Check for missing values¶

We are going to check if there are empty values.

In [127]:
def check_na(wine_type_df):
    print(f'{get_wine_str(wine_type_df)}')
    print(wine_type_df.info())
    print(wine_type_df.isnull().sum())

use_function(check_na, log_update_dfs)
White Wine(Second Log)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         4898 non-null   float64
 1   volatile acidity      4898 non-null   float64
 2   citric acid           4898 non-null   float64
 3   residual sugar        4898 non-null   float64
 4   chlorides             4898 non-null   float64
 5   free sulfur dioxide   4898 non-null   float64
 6   total sulfur dioxide  4898 non-null   float64
 7   density               4898 non-null   float64
 8   pH                    4898 non-null   float64
 9   sulphates             4898 non-null   float64
 10  alcohol               4898 non-null   float64
 11  quality               4898 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 459.3 KB
None
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64
Red Wine(Second Log)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
None
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

Observations from Checking Missing Values:¶

  1. There are no missing data in either dataset.
  2. All feature columns are of the float type, and the target column is an integer.

3.3.2 Outlier Analysis¶

We will create box plots to understand the outliers.

In [128]:
def create_box_plot(wine_type_df, folder):
    feature_columns = wine_type_df.drop(columns=['quality']).columns.tolist()
    fig, axs = plt.subplots(3, 4, figsize=(25, 18))  

    # Flatten the array of axes to make it easier to iterate over
    axs = axs.flatten()

    for i, column in enumerate(feature_columns): 
        sns.boxplot(x='quality', y=column, data=wine_type_df, ax=axs[i], legend=False, color = "grey")
        axs[i].set_title(column, fontsize=30, weight='bold')  
        axs[i].set_xlabel('')
        axs[i].set_ylabel('')

        plt.xticks(rotation=45, fontsize=20)
        plt.yticks(fontsize=20)

    title = f'Box Plots for {get_wine_str(wine_type_df)}'
    fig.suptitle(title, fontsize=40, weight='bold')
    
    plt.savefig(f'figures/{folder}/{title}.png', bbox_inches = 'tight')
    plt.show()
In [129]:
#create_plots(create_box_plot, '1 _Boxplots_wine_type', log_update_dfs)

Observations from Box Plots¶

  1. It's hard to decide should we delete outliers or not based on these plots.
  2. We are going to develop separate QQ plots for wines classified into different quality categories: bad(3-4), middle(5-6-7), and good(8-9) to better understand our outliers.

To understand distribution asymmetries, we use skewness coefficient to identify which variables deviate from normality.

To pinpoint extreme outliers, we use a a threshold of Z-score > 6 to focuse on the most anomalous data points in the wine quality dataset.

In [130]:
def calculate_z_score(wine_type_df):
    z_scores_df = wine_type_df.copy()

    print(f"\nZ Score of {get_wine_str(wine_type_df)}:\n")

    for col in wine_type_df.columns:
        if col != 'quality' and col != 'quality_category':
            z_scores_df[col + ' Z-score'] = zscore(wine_type_df[col])
            
            outliers = abs(z_scores_df[col + ' Z-score']) > 6
            outlier_count = outliers.sum()
            print(f"Outliers in {col}: {outlier_count}")

            if outlier_count > 0:
                print(f"Outlier data for {col}:\n{wine_type_df[outliers]}")
                print("\n") 
In [131]:
#use_function(calculate_z_score, log_update_dfs)

Observations from Skewness Coefficients and Z-Score Analysis¶

White Wine:¶

  • Poor Quality White Wine:
    • Notable skewness in chlorides (1.51) and free sulfur dioxide (1.85).
    • Outliers are mainly in free sulfur dioxide and total sulfur dioxide.
  • Good Quality White Wine:
    • Higher skewness in citric acid (1.94) and density (1.53).
    • Outliers are present in several attributes, notably in citric acid (7 outliers) and density (1 outlier).

Red Wine:¶

  • Poor Quality Red Wine:
    • Considerable skewness in chlorides (2.51) and sulphates (3.08).
    • Outliers are observed in sulphates (2) and alcohol (1).
  • Good Quality Red Wine:
    • Skewness is pronounced in total sulfur dioxide (2.64) and sulphates (2.18).
    • Outliers are mainly in total sulfur dioxide (2) and sulphates (2).
In [132]:
def delete_outliers(wine_type_df):
    without_outliers_df = wine_type_df.copy()

    for col in without_outliers_df.columns:
        if col != 'quality' and col != 'quality_category':
            without_outliers_df[col + '_Z_score'] = zscore(without_outliers_df[col])

    for col in without_outliers_df.columns:
        if '_Z_score' in col:
            without_outliers_df = without_outliers_df[np.abs(without_outliers_df[col]) <= 6]

    return without_outliers_df.drop(columns=[c for c in without_outliers_df if '_Z_score' in c])
In [133]:
df_white_without_outliers = log_white_update_wine_df.copy()
df_red_without_outliers = log_red_update_wine_df.copy()

df_white_without_outliers = delete_outliers(log_white_update_wine_df)
df_red_without_outliers = delete_outliers(log_red_update_wine_df)

df_white_without_outliers.wine_type = 'White Wine (Without Outliers)'
df_red_without_outliers.wine_type = 'Red Wine (Without Outliers)'

wine_without_outliers_dfs = [df_white_without_outliers, df_red_without_outliers]
In [134]:
bin_edges = [2, 5, 10]
bin_labels = ['poor', 'good']

red_wine_quality_df = df_red_without_outliers.copy()
white_wine_quality_df = df_white_without_outliers.copy()

red_wine_quality_df['quality_category'] = pd.cut(red_wine_quality_df['quality'], bins=bin_edges, labels=bin_labels)
white_wine_quality_df['quality_category'] = pd.cut(white_wine_quality_df['quality'], bins=bin_edges, labels=bin_labels)

df_red_poor_without_outliers = red_wine_quality_df[red_wine_quality_df['quality_category'] == 'poor']
df_red_good_without_outliers = red_wine_quality_df[red_wine_quality_df['quality_category'] == 'good']

df_white_poor_without_outliers = white_wine_quality_df[white_wine_quality_df['quality_category'] == 'poor']
df_white_good_without_outliers = white_wine_quality_df[white_wine_quality_df['quality_category'] == 'good']


df_white_poor_without_outliers.wine_type = 'White Wine Poor (Without Outliers)'
df_white_good_without_outliers.wine_type = 'White Wine Good (Without Outliers)'
df_red_poor_without_outliers.wine_type = 'Red Wine Poor (Without Outliers)'
df_red_good_without_outliers.wine_type = 'Red Wine Good (Without Outliers)'

wine_quality_without_outliers_dfs = [
    df_white_poor_without_outliers,
    df_white_good_without_outliers,
    df_red_poor_without_outliers,
    df_red_good_without_outliers
]
In [135]:
create_plots(create_box_plot, '2_Boxplots_without_outliers', wine_without_outliers_dfs)
No description has been provided for this image
No description has been provided for this image
In [137]:
create_plots(create_box_plot, '3_Boxplots_quality_without_outliers', wine_quality_without_outliers_dfs)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

4.1 Visualize Final Data¶

  1. Histograms of quality distribiution for each of the wine type
In [138]:
# getting quality counts for red and white wines and sort them by quality
red_quality_counts = df_red_wine['quality'].value_counts().sort_index()
white_quality_counts = df_white_wine['quality'].value_counts().sort_index()

# normalize the counts to get densities
red_density = red_quality_counts / red_quality_counts.sum()
white_density = white_quality_counts / white_quality_counts.sum()

# convert to DataFrame
red_df = red_density.reset_index().rename(columns={'index': 'Quality', 'quality': 'quality'})
white_df = white_density.reset_index().rename(columns={'index': 'Quality', 'quality': 'quality'})
merged_df = pd.merge(red_df, white_df, on='quality', how='outer').fillna(0)
bar_plot_df = merged_df.rename(columns={'count_x': 'Red Wine Density', 'count_y': 'White Wine Density'})

# match columns for both features
melted_df = pd.melt(bar_plot_df, id_vars='quality', var_name='Wine Type', value_name='Density')

# plotting
plt.figure(figsize=(12, 8))
sns.barplot(x='quality', y='Density', hue='Wine Type', data=melted_df, palette='viridis')

# description
plt.title('Density Distribution of Red and White Wines by Quality', fontsize=30, weight ='bold')
plt.xlabel('Quality',  fontsize=25)
plt.ylabel('Density', fontsize=25)
plt.xticks(rotation=45, fontsize=20)
plt.yticks(fontsize=20)
plt.legend(fontsize=20)

# save fig
plt.savefig('figures/Density Distribution of Red and White Wines by Quality.png', bbox_inches = 'tight')
plt.show()
No description has been provided for this image
  1. Raw Clustermap of our cleaned data
In [ ]:
# create labels of color to our data
df_red_wine['color'] = 1
df_white_wine['color'] = 2

df_wine = pd.concat([df_red_wine, df_white_wine[1:]], axis=0).reset_index(drop=True)

def create_cluster_by_rows(dataset):
    '''
    Displays a clustermap plot with labels.

    Parameters:
    - dataset (pandas.DataFrame): The DataFrame containing data related to red and white wines.
    '''
    # select numerical columns from the dataset
    data = dataset.select_dtypes(include=['number'])

    # extract and remove the 'color' column from the data
    agerange = data.pop("color")

    # define color mapping for wine types labels
    colors_of_labels = {1: "#b11226", 2: "#EEEDC4"}

    # Map wine types to colors
    labels = agerange.map(colors_of_labels)
    df_with_labels = pd.DataFrame(data={'wine type': labels})

    # create the clustermap
    g = sns.clustermap(df_wine.iloc[:,:-2], cmap='plasma', row_colors=df_with_labels, annot_kws={'size': 30}, 
                standard_scale=1, figsize=(15, 20))

    # Hide row and column dendrograms 
    g.ax_row_dendrogram.set_visible(False)
    g.ax_col_dendrogram.set_visible(False)    

    # describe
    title = 'Composite Clustermap White and Red Wine'
    plt.suptitle('Composite Clustermap White and Red Wine', fontsize=40, weight ='bold',y= 0.85, x = 0.60)
    plt.setp(g.ax_heatmap.get_xticklabels(), fontsize=30, rotation = 45, ha='right')
    g.ax_cbar.set_position((1, .2, .03, .4))

    plt.savefig(f'figures/{title}.png', bbox_inches = 'tight')
    plt.show()

create_cluster_by_rows(df_wine)
No description has been provided for this image

5. Model creation¶

Firstly, for the model we have the choose which features we want to use.

-> Create clustermap and heatmap to understant the relationship between features and wine quality score.

Comments about clustermap:¶

  • Our model's performance is influenced by the considerable multicollinearity observed in the data. Identifying features with minimal correlation is essential.
  • To achieve this, we compare different feature lists, selecting one from closely correlated pairs for model evaluation. Each model is restricted to a maximum of 8 features. This approach aims to uncover independent and meaningful predictors for the development of an optimal model.
In [151]:
def create_corr_matrix(wine_type_df):
    """
    Create and return the correlation matrix for the numeric columns in the given df.

    Parameters:
    - wine_type_df (pandas.DataFrame): DataFrame that we want to do matrix on

    Returns:
    The correlation matrix for the numeric columns
    """
    numeric_columns = check_numeric_columns(wine_type_df)
    return wine_type_df[numeric_columns].corr()


def create_clustermap(wine_type_df, folder):
    """
    Create and display a clustermap of the correlation matrix for the df.

    Parameters:
    - wine_type_df (pandas.DataFrame): The DataFrame that we want to display clustermap
    """
    # create the correlation matrix
    correlation_matrix = create_corr_matrix(wine_type_df)
    
    # calculate p-values for each correlation coefficient
    pvalues = correlation_matrix.apply(lambda x: 
                                       correlation_matrix.columns.map(lambda y: 
                                                                      pearsonr(wine_type_df[x.name], wine_type_df[y])[1]))
    
    # create the mask for non-significant correlations (p-value > 0.05)
    mask_significant = pvalues > 0.05
    
    # create and display the clustermap
    g = sns.clustermap(correlation_matrix, 
                       mask=mask_significant, annot=True, 
                       cmap='seismic', 
                       row_cluster=True, 
                       col_cluster=True,
                       linewidths=.5, vmax=1, vmin=-1, annot_kws={'size': 10},
                       figsize=(12, 12))
    
    # hide column dendrogram
    g.ax_col_dendrogram.set_visible(False)
    
    # descrption
    title=f'Correlation Matrix Clustermap for {get_wine_str(wine_type_df)}'
    g.fig.suptitle(title, fontsize=20, weight='bold' , y=0.85)
    plt.setp(g.ax_heatmap.get_xticklabels(), fontsize=12, rotation=45, ha='right')
    plt.setp(g.ax_heatmap.get_yticklabels(), fontsize=12)
    g.ax_cbar.set_position((1, .2,.03, .4))
    
    # save
    
    plt.savefig(f'figures/{folder}/{title}.png', bbox_inches = 'tight')
    plt.show()
In [152]:
create_plots(create_clustermap, '1_Clustermaps_wine_type', wine_without_outliers_dfs)
No description has been provided for this image
No description has been provided for this image
In [153]:
create_plots(create_clustermap, '2_Clustermaps_quality_without_outliers', wine_quality_without_outliers_dfs)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

5.1 Creating features lists for models to find the best combination¶

In [154]:
def create_corr_with_target(wine_type_df, target = 'quality'):
    correlation_matrix = create_corr_matrix(wine_type_df)
    correlation_with_target = correlation_matrix[target].sort_values(key=abs, ascending=False)

    return correlation_with_target


def return_top_features_with_target(wine_type_df, n = 8):
    '''
    A function that return top features that are related to the target column.
    '''
    correlation_with_target = create_corr_with_target(wine_type_df)

    top_features = correlation_with_target[1:n+1].index.tolist() 
    return top_features
In [155]:
def return_features_among_highly_associated(wine_type_df, threshold = 0.6, n =8 ):
    '''
    Any feature that has a correlation greater than threshold with any of the already selected features is not chosen
    '''
    correlation_matrix = create_corr_matrix(wine_type_df)
    correlation_with_target = create_corr_with_target(wine_type_df)
    #print(get_wine_str(wine_type_df))
    # print(correlation_matrix)
    #print(correlation_with_target)

    selected_features = [correlation_with_target.index[1]]  


    for feature in correlation_with_target.index[2:]:
        if all(abs(correlation_matrix[feature][selected_feature]) < threshold for selected_feature in selected_features):
            selected_features.append(feature)
            if len(selected_features) == n: 
                break

    return selected_features
In [156]:
top_features_dict = {}
features_among_highly_associated_dict = {}

def create_features_dict(df_lists):
    for wine_df in df_lists:
        try:
            df_name = get_wine_str(wine_df)
            top_features_dict[df_name] = return_top_features_with_target(wine_df)
            features_among_highly_associated_dict[df_name] = return_features_among_highly_associated(wine_df)
        except Exception as e:
            print(f"Error processing {df_name}: {e}")
In [157]:
features_among_highly_associated_dict
Out[157]:
{}
In [158]:
create_features_dict(wine_quality_without_outliers_dfs)
In [159]:
create_features_dict(wine_without_outliers_dfs)

5 Creating Model¶

5.1 Linear regression using sm.OLS¶

In [449]:
def evaluate_ols_model(X, y, test_size=0.2, random_state=42):
    """
    Cfreate OLS regression model and evalate it.

    Parameters:
    - X (pandas.DataFrame): Independent variables.
    - y (pandas.Series): Dependent variable.
    - test_size (float): Size of the test set in the train-test split (default: 0.2).
    - random_state (int): Random seed for reproducibility (default: 42).

    Returns:
    OLS model results summary, R-squared score, and Mean Squared Error (MSE) on the test set.
    """
    #train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # add a constant term to the independent variables for the OLS model
    X_train = sm.add_constant(X_train)
    X_test = sm.add_constant(X_test)

    # fit & predict
    model = sm.OLS(y_train, X_train)
    model_results = model.fit()
    y_pred = model_results.predict(X_test)

    # evaluate
    test_r2 = r2_score(y_test, y_pred)
    test_mse = mean_squared_error(y_test, y_pred)

    return model_results, test_r2, test_mse 
In [491]:
results_df = pd.DataFrame()

def find_best_model(df, features=[top_features_dict, features_among_highly_associated_dict]):
    """
    Find the best  model among different feature sets for a given dataset.

    Parameters:
    - df (pandas.DataFrame): Dataset.
    - features (list): List of feature sets to evaluate.

    Returns:
    - pandas.Series: Series containing information about the best OLS model: model name, model, features list, r2 for test set, MSE for test set.
    """
    dataset_name = get_wine_str(df)
    best_r2 = 0
    
    for feature_set in features:
        # name the model
        model_name = f"Model_{dataset_name}"
        # create X and y sets
        X = df[feature_set[dataset_name]]
        y = df['quality']
        
        # evaluate the OLS model
        model, r2, mse = evaluate_ols_model(X, y)      

        # Compare R-squared scores
        if best_r2 < r2:
            best_model, best_features, best_r2, best_mse,  = model, list(X.columns), r2, mse

        result_model = pd.Series({
            'Model name': model_name,
            'model': best_model,
            'Features': best_features,
            'Test r2': best_r2,
            'Test MSE': best_mse
        })

    return result_model

Now we are testing all different combinations of features on datasets.

In [492]:
for dataset in wine_quality_without_outliers_dfs:
    dataset_name = get_wine_str(dataset)
    model = find_best_model(dataset)
    results_df = pd.concat([results_df, model], axis=1)

5.2 The list of the best models with their scores:¶

In [489]:
results_df = results_df.transpose()
results_df

5.2.1 Comparision of poor wine:¶

In [493]:
def describe_the_model(number_of_the_model):
    '''
    Print description of the model
    
    Parameters (int): number of the model in the result_df
    '''
    model = results_df.iloc[number_of_the_model]
    print('-------------------------------------------------------')
    print(f'The summary for the {model["Model name"]}:\n\n{model["model"].summary()}.\n')
    print('-------------------------------------------------------')
    print(f'R-squared on Test Set: {model["Test r2"]}')
    print(f'Mean Squared Error on Test Set: {model["Test MSE"]}\n')
In [494]:
describe_the_model(0)
-------------------------------------------------------
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File c:\Users\zosia\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\indexes\base.py:3790, in Index.get_loc(self, key)
   3789 try:
-> 3790     return self._engine.get_loc(casted_key)
   3791 except KeyError as err:

File index.pyx:152, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:160, in pandas._libs.index.IndexEngine.get_loc()

File pandas\_libs\index_class_helper.pxi:70, in pandas._libs.index.Int64Engine._check_type()

KeyError: 'Model name'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
c:\Users\zosia\OneDrive\Desktop\Wine_quality\Wine_quality\main.ipynb Cell 71 line 1
----> <a href='vscode-notebook-cell:/c%3A/Users/zosia/OneDrive/Desktop/Wine_quality/Wine_quality/main.ipynb#Y236sZmlsZQ%3D%3D?line=0'>1</a> describe_the_model(0)

c:\Users\zosia\OneDrive\Desktop\Wine_quality\Wine_quality\main.ipynb Cell 71 line 9
      <a href='vscode-notebook-cell:/c%3A/Users/zosia/OneDrive/Desktop/Wine_quality/Wine_quality/main.ipynb#Y236sZmlsZQ%3D%3D?line=6'>7</a> model = results_df.iloc[number_of_the_model]
      <a href='vscode-notebook-cell:/c%3A/Users/zosia/OneDrive/Desktop/Wine_quality/Wine_quality/main.ipynb#Y236sZmlsZQ%3D%3D?line=7'>8</a> print('-------------------------------------------------------')
----> <a href='vscode-notebook-cell:/c%3A/Users/zosia/OneDrive/Desktop/Wine_quality/Wine_quality/main.ipynb#Y236sZmlsZQ%3D%3D?line=8'>9</a> print(f'The summary for the {model["Model name"]}:\n\n{model["model"].summary()}.\n')
     <a href='vscode-notebook-cell:/c%3A/Users/zosia/OneDrive/Desktop/Wine_quality/Wine_quality/main.ipynb#Y236sZmlsZQ%3D%3D?line=9'>10</a> print('-------------------------------------------------------')
     <a href='vscode-notebook-cell:/c%3A/Users/zosia/OneDrive/Desktop/Wine_quality/Wine_quality/main.ipynb#Y236sZmlsZQ%3D%3D?line=10'>11</a> print(f'R-squared on Test Set: {model["Test r2"]}')

File c:\Users\zosia\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\series.py:1040, in Series.__getitem__(self, key)
   1037     return self._values[key]
   1039 elif key_is_scalar:
-> 1040     return self._get_value(key)
   1042 # Convert generator to list before going through hashable part
   1043 # (We will iterate through the generator there to check for slices)
   1044 if is_iterator(key):

File c:\Users\zosia\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\series.py:1156, in Series._get_value(self, label, takeable)
   1153     return self._values[label]
   1155 # Similar to Index.get_value, but we do not fall back to positional
-> 1156 loc = self.index.get_loc(label)
   1158 if is_integer(loc):
   1159     return self._values[loc]

File c:\Users\zosia\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\indexes\base.py:3797, in Index.get_loc(self, key)
   3792     if isinstance(casted_key, slice) or (
   3793         isinstance(casted_key, abc.Iterable)
   3794         and any(isinstance(x, slice) for x in casted_key)
   3795     ):
   3796         raise InvalidIndexError(key)
-> 3797     raise KeyError(key) from err
   3798 except TypeError:
   3799     # If we have a listlike key, _check_indexing_error will raise
   3800     #  InvalidIndexError. Otherwise we fall through and re-raise
   3801     #  the TypeError.
   3802     self._check_indexing_error(key)

KeyError: 'Model name'
In [486]:
describe_the_model(2)
-------------------------------------------------------
The summary for the Model_Red Wine Poor (Without Outliers):

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                quality   R-squared:                       0.099
Model:                            OLS   Adj. R-squared:                  0.087
Method:                 Least Squares   F-statistic:                     8.017
Date:                Tue, 21 Nov 2023   Prob (F-statistic):           2.76e-10
Time:                        17:55:09   Log-Likelihood:                -180.16
No. Observations:                 592   AIC:                             378.3
Df Residuals:                     583   BIC:                             417.8
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                   -5.4040     12.471     -0.433      0.665     -29.897      19.089
volatile acidity        -0.5369      0.090     -5.985      0.000      -0.713      -0.361
total sulfur dioxide     0.0017      0.000      4.119      0.000       0.001       0.003
pH                      -0.2327      0.111     -2.093      0.037      -0.451      -0.014
citric acid             -0.2948      0.116     -2.538      0.011      -0.523      -0.067
alcohol                  0.0139      0.022      0.628      0.530      -0.030       0.058
sulphates                0.0283      0.095      0.299      0.765      -0.158       0.215
density                 11.2753     12.464      0.905      0.366     -13.204      35.754
residual sugar          -0.0408      0.048     -0.855      0.393      -0.134       0.053
==============================================================================
Omnibus:                      407.201   Durbin-Watson:                   2.014
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3857.604
Skew:                          -3.088   Prob(JB):                         0.00
Kurtosis:                      13.874   Cond. No.                     8.51e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.51e+04. This might indicate that there are
strong multicollinearity or other numerical problems..

-------------------------------------------------------
R-squared on Test Set: 0.13356530134716293
Mean Squared Error on Test Set: 0.08250272298329342

Discussion:¶

White Poor Wine (N=1639)¶

  • R-squared on the test = 0.070:

    • The R-squared value is 0.070, indicating that only 7% of the variability in the dependent variable (quality) is explained by the model. Also, the adjusted R-squared is 0.060, adjusting for the number of predictor variables, suggesting that we should reduce the number of parameters because they are not improving the model.
  • F-Statistic = 11.69:

    • The model is significant.
  • Features Significantly Constituting the Model:

    • free sulfur dioxide: A one-unit increase in free sulfur dioxide is associated with a 0.0020 increase in quality.
    • volatile acidity: A one-unit increase in volatile acidity is associated with a -0.1660 decrease in quality.
    • residual sugar: A one-unit increase in residual sugar is associated with a 0.0043 increase in quality.

Red Poor Wine (N=741)¶

  • R-squared on the test = 0.133:

    • The R-squared value is 0.099, indicating that approximately 9.9% of the variability in the dependent variable (quality) is explained by the model. However, the adjusted R-squared is 0.087, adjusting for the number of predictor variables. This is slightly lower than the R-squared, suggesting that not all added variables contribute meaningfully to the model.
  • F-Statistic = 8.017:

    • The F-statistic tests the overall significance of the model.
  • Features Significantly Constituting the Model:

    • volatile acidity: A one-unit increase in volatile acidity is associated with an approximate -0.537 decrease in wine quality.
    • total sulfur dioxide: A one-unit increase in total sulfur dioxide is associated with an increase of 0.0017 in wine quality.
    • pH: A one-unit increase in pH is associated with an approximate -0.233 decrease in wine quality.

Summary:¶

Based on our models we can observe that volatile acidity and level of free/total sulfur dioxide are importnat feature in predicitng poor wine. Since the models don't have the best performance and also the adjusting r2 suggests that we may reduce the nmber of feautres in the model without loosing a lot of information. Also it will be interesing to observe how combination of multicolinear features like total sulfur dioxide and free sulfur dioxide will influance the model.

5.2.2 Comparision of good wine:¶

In [433]:
describe_the_model(1)
-------------------------------------------------------
The summary for the Model_White Wine Good (Without Outliers):

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                quality   R-squared:                       0.116
Model:                            OLS   Adj. R-squared:                  0.113
Method:                 Least Squares   F-statistic:                     42.62
Date:                Tue, 21 Nov 2023   Prob (F-statistic):           1.61e-64
Time:                        17:43:28   Log-Likelihood:                -2161.5
No. Observations:                2603   AIC:                             4341.
Df Residuals:                    2594   BIC:                             4394.
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                  115.9796     22.978      5.048      0.000      70.923     161.036
alcohol                  0.0188      0.028      0.662      0.508      -0.037       0.074
density               -113.8570     23.251     -4.897      0.000    -159.449     -68.265
chlorides               -0.0868      0.044     -1.995      0.046      -0.172      -0.002
total sulfur dioxide     0.0004      0.000      1.076      0.282      -0.000       0.001
residual sugar           0.0499      0.009      5.636      0.000       0.033       0.067
pH                       0.6328      0.113      5.614      0.000       0.412       0.854
fixed acidity            0.0952      0.024      3.979      0.000       0.048       0.142
citric acid              0.0427      0.110      0.388      0.698      -0.174       0.259
==============================================================================
Omnibus:                      451.296   Durbin-Watson:                   1.930
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              724.773
Skew:                           1.182   Prob(JB):                    4.15e-158
Kurtosis:                       4.047   Cond. No.                     4.18e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.18e+05. This might indicate that there are
strong multicollinearity or other numerical problems..

-------------------------------------------------------
R-squared on Test Set: 0.1193013699182327
Mean Squared Error on Test Set: 0.3260904708463896

In [487]:
describe_the_model(3)
-------------------------------------------------------
The summary for the Model_Red Wine Good (Without Outliers):

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                quality   R-squared:                       0.196
Model:                            OLS   Adj. R-squared:                  0.187
Method:                 Least Squares   F-statistic:                     20.48
Date:                Tue, 21 Nov 2023   Prob (F-statistic):           7.36e-28
Time:                        17:55:23   Log-Likelihood:                -419.30
No. Observations:                 680   AIC:                             856.6
Df Residuals:                     671   BIC:                             897.3
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                   10.0697     13.171      0.765      0.445     -15.791      35.930
alcohol                  0.1289      0.022      5.962      0.000       0.086       0.171
volatile acidity        -0.1880      0.146     -1.292      0.197      -0.474       0.098
sulphates                0.6421      0.137      4.682      0.000       0.373       0.911
citric acid              0.1350      0.149      0.904      0.366      -0.158       0.428
chlorides               -0.1498      0.063     -2.375      0.018      -0.274      -0.026
total sulfur dioxide    -0.0018      0.001     -2.400      0.017      -0.003      -0.000
density                 -5.0362     13.081     -0.385      0.700     -30.722      20.649
pH                      -0.2665      0.150     -1.772      0.077      -0.562       0.029
==============================================================================
Omnibus:                      113.755   Durbin-Watson:                   1.973
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              174.803
Skew:                           1.108   Prob(JB):                     1.10e-38
Kurtosis:                       4.121   Cond. No.                     4.95e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.95e+04. This might indicate that there are
strong multicollinearity or other numerical problems..

-------------------------------------------------------
R-squared on Test Set: 0.28466886558862003
Mean Squared Error on Test Set: 0.147073314184919

Discussion:¶

White Good Wine (N=3249)¶

  • R-squared on the test = 0.119:

    • The R-squared value on the test is 0.119, indicating that approximately 11.9% of the variability in the dependent variable (quality) is explained by the model.
  • F-Statistic = 42.62:

    • The F-statistic tests the overall significance of the model. A larger F-statistic with a small p-value suggests that at least one predictor variable is significantly related to the dependent variable.
    • The model is statistically significant.
  • Features Significantly Constituting the Model:

    • density: A one-unit increase in density is associated with a quality decrease of approximately 113.8570 units.
    • cholorides: A one-unit increase in chloride content is associated with a quality decrease of approximately 0.0868 units.
    • residual sugar: A one-unit increase in residual sugar is associated with a quality increase of approximately 0.0499 units.
    • pH: A one-unit increase in pH is associated with a quality increase of approximately 0.6328 units.
    • fixed acidity: A one-unit increase in fixed acidity is associated with a quality increase of approximately 0.0952 units.

Red Good Wine (N=851)¶

  • R-squared on the test = 0.285:

    • The R-squared value on the test is 0.285, indicating that approximately 28.5% of the variability in the dependent variable (quality) is explained by the model. It is much higher value then r2 of train data, which can by cause by random fluctuation.
  • F-Statistic = 20.48:

    • The model is statistically significant.
  • Features Significantly Constituting the Model:

    • alcohol: A one-unit increase in alcohol is associated with a quality decrease of approximately 0.1289 units.
    • sulphates: A one-unit increase in sulphates content is associated with a quality decrease of approximately 0.6421 units.
    • chlorides: A one-unit increase in chloride content is associated with a quality decrease of approximately 0.1498 units.
    • total sulfur dioxide: A one-unit increase in totla sulfur dioxide is associated with a quality increase of approximately 0.0018 units.

Summary:¶

Based on our models we can say that the chlorides level can be an important predictor of good quality of wine. What is interesing here is the high score and error of density can suggest that there are two clusters of points with different density level. In the future it will be interesing to split the data to smoler groups and observe subclusters. Maybe they can by characterised by different quality level... To check it in the future it can be interesting to check how models with different density level clusters will perform.

Recipe¶

In [ ]:
def identyfy_three_important_features(evaluation_matrix_list): 
    result_dict = []
    # Iterate through unique dataframes
    for dataframe in evaluation_matrix_list:
        df_subset = dataframe[0].iloc[1:].copy()
        df_subset['Coefficients'] = df_subset['Coefficients'].abs()
        
        # Sort the subset based on coefficients in descending order
        sorted_subset = df_subset.sort_values(by='Coefficients', ascending=False).reset_index(drop=True)

        # Select the top three important features
        top_three_features = sorted_subset['Parameters'].loc[:2:].tolist()   
        # Create a dictionary entry for the current dataframe
        results_df = result_dict.append({dataframe[0]['df'][1] : top_three_features})
        
    return result_dict

Finding the recipe for the good and poor wine¶

In [ ]:
top_three_features_list = identyfy_three_important_features(evaluation_matrix_list=evaluation_matrix_list)
In [ ]:
for df in wine_quality_without_outliers_dfs:
    print(get_wine_str(df))
White Wine Poor (Without Outliers)
White Wine Good (Without Outliers)
Red Wine Poor (Without Outliers)
Red Wine Good (Without Outliers)
In [ ]:
features_to_reverse_white = ['chlorides', 'volatile acidity']
features_to_reverse_red = ['residual sugar', 'chlorides']

def correct_reverse_log_transformation(df, features):
    for feature in features:
        if feature in df.columns:
            df[feature] = np.exp(df[feature])
    return df

df_white_good_without_outliers = correct_reverse_log_transformation(df_white_good_without_outliers, features_to_reverse_white)
df_white_poor_without_outliers = correct_reverse_log_transformation(df_white_poor_without_outliers, features_to_reverse_white)

df_red_good_without_outliers = correct_reverse_log_transformation(df_red_good_without_outliers, features_to_reverse_red)
df_red_poor_without_outliers = correct_reverse_log_transformation(df_red_poor_without_outliers, features_to_reverse_red)



df_white_good_without_outliers
/var/folders/bg/px1wxjwx6dz5w5214_zfxwx40000gn/T/ipykernel_97189/2879086458.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[feature] = np.exp(df[feature])
Out[ ]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality quality_category
0 7.0 0.271 0.36 20.7 0.046 45.0 170.0 1.00100 3.00 0.45 8.8 6 good
1 6.3 0.301 0.34 1.6 0.050 14.0 132.0 0.99400 3.30 0.49 9.5 6 good
2 8.1 0.281 0.40 6.9 0.051 30.0 97.0 0.99510 3.26 0.44 10.1 6 good
3 7.2 0.231 0.32 8.5 0.059 47.0 186.0 0.99560 3.19 0.40 9.9 6 good
4 7.2 0.231 0.32 8.5 0.059 47.0 186.0 0.99560 3.19 0.40 9.9 6 good
... ... ... ... ... ... ... ... ... ... ... ... ... ...
4891 5.7 0.211 0.32 0.9 0.039 38.0 121.0 0.99074 3.24 0.46 10.6 6 good
4893 6.2 0.211 0.29 1.6 0.040 24.0 92.0 0.99114 3.27 0.50 11.2 6 good
4895 6.5 0.241 0.19 1.2 0.042 30.0 111.0 0.99254 2.99 0.46 9.4 6 good
4896 5.5 0.291 0.30 1.1 0.023 20.0 110.0 0.98869 3.34 0.38 12.8 7 good
4897 6.0 0.211 0.38 0.8 0.021 22.0 98.0 0.98941 3.26 0.32 11.8 6 good

3254 rows × 13 columns

In [ ]:
df_white_good_without_outliers
Out[ ]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality quality_category
0 7.0 0.271 0.36 20.7 0.046 45.0 170.0 1.00100 3.00 0.45 8.8 6 good
1 6.3 0.301 0.34 1.6 0.050 14.0 132.0 0.99400 3.30 0.49 9.5 6 good
2 8.1 0.281 0.40 6.9 0.051 30.0 97.0 0.99510 3.26 0.44 10.1 6 good
3 7.2 0.231 0.32 8.5 0.059 47.0 186.0 0.99560 3.19 0.40 9.9 6 good
4 7.2 0.231 0.32 8.5 0.059 47.0 186.0 0.99560 3.19 0.40 9.9 6 good
... ... ... ... ... ... ... ... ... ... ... ... ... ...
4891 5.7 0.211 0.32 0.9 0.039 38.0 121.0 0.99074 3.24 0.46 10.6 6 good
4893 6.2 0.211 0.29 1.6 0.040 24.0 92.0 0.99114 3.27 0.50 11.2 6 good
4895 6.5 0.241 0.19 1.2 0.042 30.0 111.0 0.99254 2.99 0.46 9.4 6 good
4896 5.5 0.291 0.30 1.1 0.023 20.0 110.0 0.98869 3.34 0.38 12.8 7 good
4897 6.0 0.211 0.38 0.8 0.021 22.0 98.0 0.98941 3.26 0.32 11.8 6 good

3254 rows × 13 columns

In [ ]:
def get_receipes(top_three_features_list, df_lists):
    for df in df_lists:
        pipeline_name = f"Pipeline_{get_wine_str(df)}"
        for model in top_three_features_list:
            if pipeline_name in model:
                feature_list = model[pipeline_name]
                numeric_columns = check_numeric_columns(df)
                numeric_df = df[numeric_columns]

                means = numeric_df[feature_list].mean()
                stds = numeric_df[feature_list].std()

                ranges = {feature: (means[feature] - stds[feature], means[feature] + stds[feature]) for feature in feature_list}

                print(f"\nRecipes of {get_wine_str(df)} wine:")
                for feature, range_vals in ranges.items():
                    print(f"{feature.capitalize()}: {range_vals[0]:.2f} - {range_vals[1]:.2f}")
In [ ]:
get_receipes(top_three_features_list, wine_quality_without_outliers_dfs) 
Recipes of White Wine Poor (Without Outliers) wine:
Volatile acidity: 0.20 - 0.42
Free sulfur dioxide: 16.18 - 53.87
Fixed acidity: 6.08 - 7.85

Recipes of White Wine Good (Without Outliers) wine:
Alcohol: 9.60 - 12.09
Residual sugar: 1.22 - 10.86
Volatile acidity: 0.17 - 0.35

Recipes of Red Wine Poor (Without Outliers) wine:
Volatile acidity: 0.41 - 0.77
Total sulfur dioxide: 17.86 - 91.21
Citric acid: 0.06 - 0.42

Recipes of Red Wine Good (Without Outliers) wine:
Alcohol: 9.75 - 11.96
Sulphates: 0.55 - 0.83
Chlorides: 0.05 - 0.12
In [ ]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.ensemble import RandomForestClassifier

# Function to train Random Forest and get feature importances
def get_feature_importances(dataframe):
    # Check and select only numeric columns, excluding 'quality'
    numeric_columns = dataframe.select_dtypes(include=np.number).columns.tolist()
    if 'quality' in numeric_columns:
        numeric_columns.remove('quality')
    X = dataframe[numeric_columns]
    y = dataframe['quality']
    rf = RandomForestClassifier(random_state=42)
    rf.fit(X, y)
    return rf.feature_importances_, numeric_columns

# Load your dataframes here
# df_red_good_without_outliers, df_red_poor_without_outliers, df_white_good_without_outliers, df_white_poor_without_outliers

# Get feature importances and numeric columns
good_red_importances, good_red_columns = get_feature_importances(df_red_good_without_outliers)
poor_red_importances, poor_red_columns = get_feature_importances(df_red_poor_without_outliers)
good_white_importances, good_white_columns = get_feature_importances(df_white_good_without_outliers)
poor_white_importances, poor_white_columns = get_feature_importances(df_white_poor_without_outliers)

# Plotting
fig, axes = plt.subplots(1, 4, figsize=(20, 5), sharey=True)
axes[0].bar(good_red_columns, good_red_importances)
axes[1].bar(poor_red_columns, poor_red_importances)
axes[2].bar(good_white_columns, good_white_importances)
axes[3].bar(poor_white_columns, poor_white_importances)

# Add titles and sample sizes as annotations
for i, df, columns in zip(range(4), 
                          [df_red_good_without_outliers, df_red_poor_without_outliers, df_white_good_without_outliers, df_white_poor_without_outliers], 
                          [good_red_columns, poor_red_columns, good_white_columns, poor_white_columns]):
    axes[i].set_title(f"{'Good' if i%2 == 0 else 'Poor'} {'Red' if i < 2 else 'White'} Wine")
    axes[i].set_xticks(range(len(columns)))
    axes[i].set_xticklabels(columns, rotation=45, ha="right")
    axes[i].annotate(f"n = {len(df)}", xy=(0.5, 1), xycoords='axes fraction', ha='center')

plt.tight_layout()
plt.show()
No description has been provided for this image

Significant Statement¶

Conclusion & Discussions¶

Additional - Gridsearch Model creation¶

Firstly, we decided to create the Grid seach with cross-validation to find the best model for each of our datasets, but after realising that SciKit Learn doesn't provide statistical analysis of the model we had to recreate the Linear Model using scipy library.**

Function: evaluate_GS_model¶

Parameters:

  • pipe - the pipeline through which we pass
  • X_train, y_train, X_test, y_test - data
  • classifier_params - hyperparameters for the evaluated model
  • pipeline_name
  • cv - type of cross-validation (StratifiedKFold 5-fold)

We define a GridSearchCV refitting on r2.

  • Fit the model.
  • Predict test data.
  • Extract the mean r2 from cross-validation.
  • Select the model with the best parameters.
  • Save the results to a dataframe.
  • Calculate the score matrix (since SciKit Learn doesn't provide the statistical analysis of the model we tried to recreate it using stats library)
In [180]:
def evaluate_GS_model(
    pipe,
    X_train, 
    y_train, 
    X_test, 
    y_test, 
    regression_params,
    pipeline_name,
    cv=KFold(n_splits=5),
    predict_test=True,
    predict_train=True,
):
    """
    Evaluate a machine learning model using grid search with cross-validation.

    Parameters:
    - pipe (sklearn.pipeline.Pipeline): The machine learning pipeline to evaluate.
    - X_train, y_train, X_test, y_test (pandas.DataFrame): Training and testing datasets.
    - regression_params (dict): Parameter grid for grid search.
    - pipeline_name (str): Name of the machine learning pipeline.
    - cv (sklearn.model_selection._split): Cross-validation strategy (default: KFold with 5 splits).
    - predict_test (bool): Whether to predict and evaluate on the test set.
    - predict_train (bool): Whether to predict and evaluate on the training set.

    Returns:
    - tuple: Results DataFrame and Evaluation Matrix DataFrame.
    """
    
    # define grid search
    grid_search_model = GridSearchCV(
        pipe,
        regression_params,
        cv=cv,
        scoring=["r2", "neg_mean_squared_error", "neg_median_absolute_error"],
        refit="neg_mean_squared_error",
        return_train_score=True,
        verbose=3,
    )

    # fit model
    grid_search_model.fit(X_train, y_train)
    y_test_pred = grid_search_model.predict(X_test) if predict_test else None
    test_score = metrics.mean_squared_error(y_test, y_test_pred) if predict_test else None

    y_train_pred = grid_search_model.predict(X_train) if predict_train else None
    train_score = metrics.mean_squared_error(y_train, y_train_pred) if predict_train else None 

    mean_cv_score = grid_search_model.best_score_

    cv_results_df = pd.DataFrame(grid_search_model.cv_results_).iloc[[grid_search_model.best_index_]]
    cv_splits_scores_df = cv_results_df.filter(regex=r"split\d*_test_mean_squared_error").reset_index(drop=True) 
    
    # getting scores based on the metrics defined in the score
    metrics_results_df = cv_results_df.filter(regex=r"mean_test_*").reset_index(drop=True)

    best_estimator = grid_search_model.best_estimator_.named_steps[pipe.steps[-1][0]]

    coefficients = best_estimator.coef_
    intercept = best_estimator.intercept_

    # save results in DataFrame
    this_result = pd.concat(
        [pd.DataFrame({
            "pipeline_name": [pipeline_name],
            "features": [list(X_test.columns)],
            'coef': [coefficients], 
            "train score": [train_score],
            "mean_cv_score": [mean_cv_score],
            "test_score": [test_score],
            "best_model": [grid_search_model.best_estimator_],
            "parameters": [grid_search_model.best_params_],
            }),
            cv_splits_scores_df,
            metrics_results_df,
        ],
        axis=1
    )

    # create DataFrame for the evaluation matrix
    parameters = np.append(intercept, coefficients)
    dataset_column_names = ['Intercept']
    dataset_column_names.extend(list(X_test.columns))
    
    evaluation_matrix_df = pd.DataFrame({
        'df': [pipeline_name for number in range(len(parameters))],
        'const': [pipeline_name for number in range(len(parameters))],
        "Parameters": pd.Series(dataset_column_names),
        "Coefficients": pd.Series(parameters),
    })
    
    return this_result, evaluation_matrix_df

Define Linear Models using pipelines¶

  • For y we take the labels representing the quality of wine form 1 to 10 (1 the lowest score)
  • For X we select the 8 psychomical features.
  • The three pairs of the features that were highly correlated we chose the former, the later or the mean of both.
In [181]:
#params for our gridsearch
base_steps = [('scaler', StandardScaler())] #normalizing values
lr = ("lr", LinearRegression())
lr_params = dict(lr__fit_intercept=[True, False])

estimators = [(lr, lr_params)]
In [189]:
def building_pipelines(df_lists, features=[top_features_dict, features_among_highly_associated_dict], test_size=0.2, random_state=42):
    """
    Create pipelines and evaluate models for each dataset and feature set.

    Parameters:
    - df_lists (list): List of datasets.
    - features (list): List of feature sets to use in the pipelines.
    - test_size (float): Size of the test set in the train-test split (default: 0.2).
    - random_state (int): Random seed for reproducibility (default: 42).

    Returns:
    - tuple: Pipeline, DataFrame with scores, and a list of evaluation matrices.
    """
    results_df = pd.DataFrame()
    evaluation_matrix_list = []

    # iterate through each dataset
    for dataset in df_lists:
        dataset_name = get_wine_str(dataset)
        
        # iterate through each feature list 
        for feature_set in features:
            # name the pipeline
            pipeline_name = f"Pipeline_{dataset_name}"

            # create X and y sets
            y = dataset['quality']
            X = dataset[feature_set[dataset_name]]

            # train-test split
            X_train, X_test, y_train, y_test = train_test_split(
                X, 
                y, 
                test_size=test_size, 
                random_state=random_state
            )

            # GridSearch 
            for (estimator, params) in estimators:
                print(f"Evaluating {estimator}\n")

                # create a pipeline from the base steps list and estimator
                pipe = Pipeline(base_steps + [estimator])

                this_results, evaluation_matrix = evaluate_GS_model(
                    pipe=pipe,
                    pipeline_name=pipeline_name,
                    X_train=X_train,
                    y_train=y_train,
                    X_test=X_test,
                    y_test=y_test,
                    regression_params=params,
                    predict_test=True,
                    predict_train=True,
                    cv=StratifiedKFold(n_splits=5)
                )
                results_df = pd.concat([results_df, this_results], ignore_index=True)

                evaluation_matrix_list.append([evaluation_matrix])

    return pipe, results_df, evaluation_matrix_list

Testing pipeline on our datasets¶

In [190]:
pipe, results_df, evaluation_matrix_list = building_piplines(wine_quality_without_outliers_dfs)
Rating ('lr', LinearRegression()) 

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.113, test=-0.116) neg_median_absolute_error: (train=-0.113, test=-0.117) r2: (train=0.079, test=0.089) total time=   0.0s
[CV 2/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.113, test=-0.118) neg_median_absolute_error: (train=-0.112, test=-0.110) r2: (train=0.082, test=0.073) total time=   0.0s
[CV 3/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.112, test=-0.121) neg_median_absolute_error: (train=-0.113, test=-0.102) r2: (train=0.082, test=0.068) total time=   0.0s
[CV 4/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.114, test=-0.115) neg_median_absolute_error: (train=-0.113, test=-0.128) r2: (train=0.093, test=0.017) total time=   0.0s
[CV 5/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.115, test=-0.108) neg_median_absolute_error: (train=-0.116, test=-0.099) r2: (train=0.081, test=0.080) total time=   0.0s
[CV 1/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-23.951, test=-23.974) neg_median_absolute_error: (train=-4.962, test=-4.975) r2: (train=-193.909, test=-187.441) total time=   0.0s
[CV 2/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-23.951, test=-23.943) neg_median_absolute_error: (train=-4.967, test=-4.971) r2: (train=-193.906, test=-187.196) total time=   0.0s
[CV 3/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-23.959, test=-23.811) neg_median_absolute_error: (train=-4.969, test=-4.953) r2: (train=-195.144, test=-181.995) total time=   0.0s
[CV 4/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-23.943, test=-24.073) neg_median_absolute_error: (train=-4.965, test=-4.983) r2: (train=-189.907, test=-204.653) total time=   0.0s
[CV 5/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-23.945, test=-23.924) neg_median_absolute_error: (train=-4.966, test=-4.960) r2: (train=-189.919, test=-203.378) total time=   0.0s
Rating ('lr', LinearRegression()) 

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.115, test=-0.117) neg_median_absolute_error: (train=-0.120, test=-0.126) r2: (train=0.062, test=0.083) total time=   0.0s
[CV 2/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.115, test=-0.119) neg_median_absolute_error: (train=-0.121, test=-0.121) r2: (train=0.067, test=0.062) total time=   0.0s
[CV 3/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.114, test=-0.120) neg_median_absolute_error: (train=-0.122, test=-0.112) r2: (train=0.063, test=0.075) total time=   0.0s
[CV 4/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.115, test=-0.117) neg_median_absolute_error: (train=-0.117, test=-0.120) r2: (train=0.080, test=-0.003) total time=   0.0s
[CV 5/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.116, test=-0.112) neg_median_absolute_error: (train=-0.119, test=-0.115) r2: (train=0.072, test=0.042) total time=   0.0s
[CV 1/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-23.953, test=-23.958) neg_median_absolute_error: (train=-4.974, test=-4.978) r2: (train=-193.926, test=-187.314) total time=   0.0s
[CV 2/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-23.953, test=-23.975) neg_median_absolute_error: (train=-4.970, test=-4.982) r2: (train=-193.921, test=-187.447) total time=   0.0s
[CV 3/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-23.962, test=-23.856) neg_median_absolute_error: (train=-4.976, test=-4.965) r2: (train=-195.164, test=-182.338) total time=   0.0s
[CV 4/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-23.945, test=-24.016) neg_median_absolute_error: (train=-4.966, test=-4.979) r2: (train=-189.920, test=-204.165) total time=   0.0s
[CV 5/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-23.946, test=-23.933) neg_median_absolute_error: (train=-4.973, test=-4.960) r2: (train=-189.928, test=-203.455) total time=   0.0s
Rating ('lr', LinearRegression()) 

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.307, test=-0.316) neg_median_absolute_error: (train=-0.394, test=-0.403) r2: (train=0.119, test=0.101) total time=   0.0s
[CV 2/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.305, test=-0.321) neg_median_absolute_error: (train=-0.393, test=-0.402) r2: (train=0.122, test=0.087) total time=   0.0s
[CV 3/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.309, test=-0.306) neg_median_absolute_error: (train=-0.395, test=-0.395) r2: (train=0.112, test=0.130) total time=   0.0s
[CV 4/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.311, test=-0.299) neg_median_absolute_error: (train=-0.393, test=-0.374) r2: (train=0.114, test=0.120) total time=   0.0s
[CV 5/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.308, test=-0.307) neg_median_absolute_error: (train=-0.393, test=-0.410) r2: (train=0.116, test=0.116) total time=   0.0s
[CV 1/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-41.059, test=-41.015) neg_median_absolute_error: (train=-6.193, test=-6.187) r2: (train=-117.012, test=-115.627) total time=   0.0s
[CV 2/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-41.052, test=-41.138) neg_median_absolute_error: (train=-6.193, test=-6.188) r2: (train=-117.029, test=-115.828) total time=   0.0s
[CV 3/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-41.055, test=-41.067) neg_median_absolute_error: (train=-6.180, test=-6.193) r2: (train=-117.039, test=-115.628) total time=   0.0s
[CV 4/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-41.073, test=-41.279) neg_median_absolute_error: (train=-6.178, test=-6.195) r2: (train=-116.042, test=-120.540) total time=   0.0s
[CV 5/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-41.065, test=-40.801) neg_median_absolute_error: (train=-6.186, test=-6.158) r2: (train=-116.699, test=-116.318) total time=   0.0s
Rating ('lr', LinearRegression()) 

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.308, test=-0.313) neg_median_absolute_error: (train=-0.402, test=-0.405) r2: (train=0.114, test=0.110) total time=   0.0s
[CV 2/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.306, test=-0.322) neg_median_absolute_error: (train=-0.393, test=-0.402) r2: (train=0.121, test=0.084) total time=   0.0s
[CV 3/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.309, test=-0.308) neg_median_absolute_error: (train=-0.403, test=-0.398) r2: (train=0.111, test=0.125) total time=   0.0s
[CV 4/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.312, test=-0.299) neg_median_absolute_error: (train=-0.400, test=-0.386) r2: (train=0.112, test=0.118) total time=   0.0s
[CV 5/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.308, test=-0.311) neg_median_absolute_error: (train=-0.400, test=-0.416) r2: (train=0.116, test=0.105) total time=   0.0s
[CV 1/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-41.061, test=-40.973) neg_median_absolute_error: (train=-6.181, test=-6.165) r2: (train=-117.017, test=-115.508) total time=   0.0s
c:\Users\zosia\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_split.py:737: UserWarning: The least populated class in y has only 4 members, which is less than n_splits=5.
  warnings.warn(
c:\Users\zosia\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_split.py:737: UserWarning: The least populated class in y has only 4 members, which is less than n_splits=5.
  warnings.warn(
[CV 2/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-41.052, test=-41.174) neg_median_absolute_error: (train=-6.185, test=-6.183) r2: (train=-117.031, test=-115.931) total time=   0.0s
[CV 3/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-41.055, test=-41.141) neg_median_absolute_error: (train=-6.175, test=-6.189) r2: (train=-117.040, test=-115.838) total time=   0.0s
[CV 4/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-41.074, test=-41.249) neg_median_absolute_error: (train=-6.178, test=-6.190) r2: (train=-116.044, test=-120.454) total time=   0.0s
[CV 5/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-41.065, test=-40.775) neg_median_absolute_error: (train=-6.175, test=-6.164) r2: (train=-116.698, test=-116.245) total time=   0.0s
Rating ('lr', LinearRegression()) 

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.108, test=-0.109) neg_median_absolute_error: (train=-0.107, test=-0.098) r2: (train=0.074, test=0.168) total time=   0.0s
[CV 2/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.105, test=-0.120) neg_median_absolute_error: (train=-0.108, test=-0.105) r2: (train=0.098, test=0.085) total time=   0.0s
[CV 3/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.112, test=-0.092) neg_median_absolute_error: (train=-0.103, test=-0.115) r2: (train=0.096, test=0.097) total time=   0.0s
[CV 4/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.113, test=-0.089) neg_median_absolute_error: (train=-0.107, test=-0.086) r2: (train=0.092, test=0.119) total time=   0.0s
[CV 5/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.099, test=-0.150) neg_median_absolute_error: (train=-0.109, test=-0.104) r2: (train=0.153, test=-0.138) total time=   0.0s
[CV 1/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-24.103, test=-24.027) neg_median_absolute_error: (train=-4.979, test=-4.979) r2: (train=-205.805, test=-182.519) total time=   0.0s
[CV 2/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-24.101, test=-24.176) neg_median_absolute_error: (train=-4.978, test=-4.983) r2: (train=-205.781, test=-183.659) total time=   0.0s
[CV 3/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-24.068, test=-24.153) neg_median_absolute_error: (train=-4.970, test=-4.989) r2: (train=-193.265, test=-237.012) total time=   0.0s
[CV 4/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-24.069, test=-24.062) neg_median_absolute_error: (train=-4.980, test=-4.969) r2: (train=-193.269, test=-236.116) total time=   0.0s
[CV 5/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-24.096, test=-24.000) neg_median_absolute_error: (train=-4.965, test=-4.952) r2: (train=-206.141, test=-180.912) total time=   0.0s
Rating ('lr', LinearRegression()) 

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.108, test=-0.109) neg_median_absolute_error: (train=-0.108, test=-0.101) r2: (train=0.076, test=0.165) total time=   0.0s
[CV 2/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.105, test=-0.122) neg_median_absolute_error: (train=-0.108, test=-0.104) r2: (train=0.101, test=0.070) total time=   0.0s
[CV 3/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.112, test=-0.093) neg_median_absolute_error: (train=-0.104, test=-0.116) r2: (train=0.099, test=0.083) total time=   0.0s
[CV 4/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.112, test=-0.089) neg_median_absolute_error: (train=-0.107, test=-0.085) r2: (train=0.092, test=0.122) total time=   0.0s
[CV 5/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.099, test=-0.152) neg_median_absolute_error: (train=-0.105, test=-0.106) r2: (train=0.152, test=-0.151) total time=   0.0s
[CV 1/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-24.103, test=-24.045) neg_median_absolute_error: (train=-4.980, test=-4.979) r2: (train=-205.802, test=-182.654) total time=   0.0s
[CV 2/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-24.100, test=-24.131) neg_median_absolute_error: (train=-4.977, test=-4.984) r2: (train=-205.777, test=-183.316) total time=   0.0s
[CV 3/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-24.068, test=-24.181) neg_median_absolute_error: (train=-4.970, test=-4.982) r2: (train=-193.262, test=-237.280) total time=   0.0s
[CV 4/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-24.069, test=-24.052) neg_median_absolute_error: (train=-4.978, test=-4.968) r2: (train=-193.269, test=-236.016) total time=   0.0s
[CV 5/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-24.096, test=-24.007) neg_median_absolute_error: (train=-4.965, test=-4.958) r2: (train=-206.141, test=-180.964) total time=   0.0s
Rating ('lr', LinearRegression()) 

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.200, test=-0.212) neg_median_absolute_error: (train=-0.321, test=-0.284) r2: (train=0.206, test=0.126) total time=   0.0s
[CV 2/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.206, test=-0.187) neg_median_absolute_error: (train=-0.322, test=-0.325) r2: (train=0.183, test=0.230) total time=   0.0s
[CV 3/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.204, test=-0.192) neg_median_absolute_error: (train=-0.322, test=-0.271) r2: (train=0.189, test=0.207) total time=   0.0s
[CV 4/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.199, test=-0.217) neg_median_absolute_error: (train=-0.304, test=-0.371) r2: (train=0.193, test=0.177) total time=   0.0s
[CV 5/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.196, test=-0.231) neg_median_absolute_error: (train=-0.299, test=-0.324) r2: (train=0.209, test=0.111) total time=   0.0s
[CV 1/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-39.608, test=-39.506) neg_median_absolute_error: (train=-6.165, test=-6.181) r2: (train=-156.177, test=-162.140) total time=   0.0s
[CV 2/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-39.614, test=-39.401) neg_median_absolute_error: (train=-6.161, test=-6.145) r2: (train=-156.199, test=-161.706) total time=   0.0s
[CV 3/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-39.612, test=-39.814) neg_median_absolute_error: (train=-6.159, test=-6.177) r2: (train=-156.194, test=-163.413) total time=   0.0s
[CV 4/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-39.561, test=-39.274) neg_median_absolute_error: (train=-6.170, test=-6.129) r2: (train=-159.378, test=-148.131) total time=   0.0s
[CV 5/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-39.581, test=-39.966) neg_median_absolute_error: (train=-6.164, test=-6.186) r2: (train=-158.922, test=-152.619) total time=   0.0s
Rating ('lr', LinearRegression()) 

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.200, test=-0.210) neg_median_absolute_error: (train=-0.313, test=-0.292) r2: (train=0.207, test=0.132) total time=   0.0s
[CV 2/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.205, test=-0.187) neg_median_absolute_error: (train=-0.312, test=-0.326) r2: (train=0.187, test=0.229) total time=   0.0s
[CV 3/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.204, test=-0.191) neg_median_absolute_error: (train=-0.316, test=-0.266) r2: (train=0.192, test=0.211) total time=   0.0s
[CV 4/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.199, test=-0.214) neg_median_absolute_error: (train=-0.303, test=-0.361) r2: (train=0.195, test=0.187) total time=   0.0s
[CV 5/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.195, test=-0.231) neg_median_absolute_error: (train=-0.295, test=-0.313) r2: (train=0.213, test=0.111) total time=   0.0s
[CV 1/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-39.608, test=-39.494) neg_median_absolute_error: (train=-6.166, test=-6.185) r2: (train=-156.175, test=-162.092) total time=   0.0s
[CV 2/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-39.613, test=-39.405) neg_median_absolute_error: (train=-6.160, test=-6.151) r2: (train=-156.195, test=-161.723) total time=   0.0s
[CV 3/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-39.612, test=-39.811) neg_median_absolute_error: (train=-6.161, test=-6.169) r2: (train=-156.191, test=-163.399) total time=   0.0s
[CV 4/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-39.560, test=-39.312) neg_median_absolute_error: (train=-6.160, test=-6.119) r2: (train=-159.376, test=-148.273) total time=   0.0s
[CV 5/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-39.580, test=-39.955) neg_median_absolute_error: (train=-6.163, test=-6.188) r2: (train=-158.918, test=-152.576) total time=   0.0s

Results of the model:¶

In [191]:
results_df
Out[191]:
pipeline_name features coef train score mean_cv_score test_score best_model parameters mean_test_r2 mean_test_neg_mean_squared_error mean_test_neg_median_absolute_error
0 Pipeline_White Wine Poor (Without Outliers) [free sulfur dioxide, volatile acidity, total ... [0.009804424160145675, -0.059479220409756464, ... 0.113651 -0.115599 0.136677 (StandardScaler(), LinearRegression()) {'lr__fit_intercept': True} 0.065193 -0.115599 -0.111200
1 Pipeline_White Wine Poor (Without Outliers) [free sulfur dioxide, volatile acidity, residu... [0.037149606235627075, -0.05443671297011143, 0... 0.115442 -0.117185 0.136345 (StandardScaler(), LinearRegression()) {'lr__fit_intercept': True} 0.051781 -0.117185 -0.118854
2 Pipeline_White Wine Good (Without Outliers) [alcohol, density, chlorides, total sulfur dio... [0.02335745624729898, -0.33332421629431724, -0... 0.308168 -0.310040 0.326090 (StandardScaler(), LinearRegression()) {'lr__fit_intercept': True} 0.110858 -0.310040 -0.396855
3 Pipeline_White Wine Good (Without Outliers) [alcohol, chlorides, total sulfur dioxide, res... [0.20322606485574743, -0.03970062880296661, 0.... 0.308824 -0.310859 0.328447 (StandardScaler(), LinearRegression()) {'lr__fit_intercept': True} 0.108493 -0.310859 -0.401499
4 Pipeline_Red Wine Poor (Without Outliers) [volatile acidity, total sulfur dioxide, pH, f... [-0.0984315648917267, 0.06506705592099327, -0.... 0.107697 -0.111995 0.083575 (StandardScaler(), LinearRegression()) {'lr__fit_intercept': True} 0.066067 -0.111995 -0.101692
5 Pipeline_Red Wine Poor (Without Outliers) [volatile acidity, total sulfur dioxide, pH, c... [-0.09738634750601528, 0.06258791750627782, -0... 0.107612 -0.113002 0.082503 (StandardScaler(), LinearRegression()) {'lr__fit_intercept': True} 0.057932 -0.113002 -0.102416
6 Pipeline_Red Wine Good (Without Outliers) [alcohol, volatile acidity, sulphates, citric ... [0.12993460890703012, -0.030543707212247334, 0... 0.201659 -0.207719 0.147937 (StandardScaler(), LinearRegression()) {'lr__fit_intercept': True} 0.169862 -0.207719 -0.314700
7 Pipeline_Red Wine Good (Without Outliers) [alcohol, volatile acidity, sulphates, citric ... [0.14322209473124037, -0.03049374464085257, 0.... 0.200962 -0.206721 0.147073 (StandardScaler(), LinearRegression()) {'lr__fit_intercept': True} 0.173800 -0.206721 -0.311507

Discussion:¶

For each of datasets we identify the most suitable dataset:

White Poor Wine (N=1639) (1)

  • r2 on the test = 0.136345
  • Features: Volatile Acidity, Free Sulfur Dioxide, Residual Sugar, Alcohol, Total Sulfur Dioxide, Fixed Acidity, Density, Citric Acid.

White Good Wine (N=3249) (3)

  • r2 on the test = 0.328447
  • Features: Alcohol, Density, Chlorides, Total Sulfur Dioxide, Residual Sugar, pH, Fixed Acidity

Red Poor Wine (N=741) (4)

  • r2 on the test = 0.083575
  • Features: Volatile Acidity, Total Sulfur Dioxide, pH, Citric Acid, Alcohol, Sulphates, Density, Residual Sugar.

Red Good Wine (N=851) (6)

  • r2 on the test = 0.147937
  • Features: Alcohol, volatile acidity, sulphates, chlorides, total sulfur dioxide, residual sugar, pH

Comment: Overall, we can say that the models don't predict quality very much, r2 suggest that they describe between 13% to 32% of variance.

When it comes to Intercept and coeficients scores of the model there are presented below:¶

e.g. Pipeline_White Wine Poor (Without Outliers)

In [188]:
evaluation_matrix_list[1][0]
Out[188]:
df const Parameters Coefficients
0 Pipeline_White Wine Poor (Without Outliers) Pipeline_White Wine Poor (Without Outliers) Intercept 4.882263
1 Pipeline_White Wine Poor (Without Outliers) Pipeline_White Wine Poor (Without Outliers) free sulfur dioxide 0.037150
2 Pipeline_White Wine Poor (Without Outliers) Pipeline_White Wine Poor (Without Outliers) volatile acidity -0.054437
3 Pipeline_White Wine Poor (Without Outliers) Pipeline_White Wine Poor (Without Outliers) residual sugar 0.022813
4 Pipeline_White Wine Poor (Without Outliers) Pipeline_White Wine Poor (Without Outliers) alcohol -0.017102
5 Pipeline_White Wine Poor (Without Outliers) Pipeline_White Wine Poor (Without Outliers) fixed acidity -0.024791
6 Pipeline_White Wine Poor (Without Outliers) Pipeline_White Wine Poor (Without Outliers) citric acid 0.016143
7 Pipeline_White Wine Poor (Without Outliers) Pipeline_White Wine Poor (Without Outliers) chlorides -0.007053
8 Pipeline_White Wine Poor (Without Outliers) Pipeline_White Wine Poor (Without Outliers) sulphates 0.003767

Comment: based on the coeddicient we can say that the most importnat feature for White poor wine quality prediction is volatile acidity (|score| = 0.05). However, because the model doesn't provide statistical calcualtion od t value and p-value, we don't know if the score is significant.

In [ ]: